Identifying Mislabeled Training Data
نویسندگان
چکیده
This paper presents a new approach to identifying and eliminating mislabeled training instances for supervised learning. The goal of this approach is to improve classiication accuracies produced by learning algorithms by improving the quality of the training data. Our approach uses a set of learning algorithms to create classiiers that serve as noise lters for the training data. We evaluate single algorithm, majority vote and consensus lters on ve datasets that are prone to labeling errors. Our experiments illustrate that ltering signiicantly improves classiication accuracy for noise levels up to 30%. An analytical and empirical evaluation of the precision of our approach shows that consensus lters are conservative at throwing away good data at the expense of retaining bad data and that majority lters are better at detecting bad data at the expense of throwing away good data. This suggests that for situations in which there is a paucity of data, consensus lters are preferable, whereas majority vote lters are preferable for situations with an abundance of data.
منابع مشابه
Kernel Based Detection of Mislabeled Training Examples
The problem of identifying mislabeled training examples has been examined in several studies, with a variety of approaches developed for editing the training data to obtain better classifiers. Many of these approaches involve applying an individual or an ensemble of classifiers to the training set and filtering the mislabeled examples based on their consistency with respect to the classifier’s ...
متن کاملImproving Automated Land Cover Mapping by Identifying and Eliminating Mislabeled Observations from Training Data
This paper presents a new approach to identifying and eliminating mislabeled training samples. The goal of this technique is to decrease the error of classification algorithms by improving the quality of the training data. The approach employs an ensemble of classifiers that serve as a filter for the training data. Using an n-fold cross validation, the training data is passed through the filter...
متن کاملIdentifying the Mislabeled Training Samples of ECG Signals using Machine Learning
The classification accuracy of electrocardiogram signal is often affected by diverse factors in which mislabeled training samples issue is one of the most influential problems. In order to mitigate this negative effect, the method of cross validation is introduced to identify the mislabeled samples. The method utilizes the cooperative advantages of different classifiers to act as a filter for t...
متن کاملIdentifying and Eliminating Mislabeled Training Instances
This paper presents a new approach to identifying and eliminating mislabeled training instances. The goal of this technique is to improve classiication accuracies produced by learning algorithms by improving the quality of the training data. The approach employs an ensemble of clas-siiers that serve as a lter for the training data. Using an n-fold cross validation, the training data is passed t...
متن کاملBoosted Noise Filters for Identifying Mislabeled Data
In many practical classification problems, mislabeled data instances (i.e., class noise) exist in the acquired (training) data and often have a detrimental effect on the classification performance. Identifying such noisy instances and removing them from training data can significantly improve the trained classifiers. One such effective noise detector is the so-called ensemble filter, which pred...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- J. Artif. Intell. Res.
دوره 11 شماره
صفحات -
تاریخ انتشار 1999